insert intro here: why this dataset, background research on topic and data, what this data and report about, table of contents description
Seattle boasts among the “hottest” housing markets in the United States; as of July 2018, Seattle “led the nation in home price gains” for 21 straight months, topped only by Portland in the 1990s – a trend driven by the city’s tech sector and a lack of supply compared with demand (https://bit.ly/2v5UMcn). Given the Seattle housing market’s notoriety for high prices, we were interested in exploring which variables affect housing price in this market. With this goal in mind, we found a public dataset on Kaggle (“House Sales in King County, USA”) offering 21,613 observations across 21 variables. According to Kaggle, it “includes homes sold between May 2014 and May 2015.” Although it doesn’t explore macro-level variables affecting housing price (such as the local job market, Amazon presence, etc.), it does focus on micro-level variables, such as renovations, number of bedrooms, square feet of living space, etc. that are common to virtually all housing markets in the United States. As a result, our analysis could lay the groundwork for future comparative analysis with other housing markets across the country.
This report is organized as follows:
can delete links and explain dataset and variables and all that, as well as what done with cleaning coding (including reasoning for log price, whici will later be more clear with graphs)
As mentioned previously, our dataset houses 21,613 observations across 21 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and come from the following link: https://bit.ly/2MsyRFl; astericks next to variable name indicates usage in our analysis:
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7129300520 6414100192 5631500400 2487200875 1954400510 ...
## $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
For our exploratory data analysis, we ignored “Id” and “Date” because these are independent variables with no relation to price. We also ignored “floors” because it can be considered a proxy for sqft_living. “Waterfront” and “View” were dropped because the vast majority of properties were coded as “0”. We ignored sqft_basement and sqft_above because they were corollaries of “sqft_living” (we didn’t want redundancy in our analysis). We also ignored “sqft_living 15” and “sqft_lot15” because we were interested only in the attributes of individual houses, not those of their surrounding neighborhoods (although that could make for an interesting follow-up study).
Following these decisions, we cleaned the data accordingly: we dropped “waterfront” and “view”; we subsetted the dataset to include only properties with more than 0 bedrooms and bathrooms (we considered these “outlier” properties); we subsetted the dataset to include only properties with less than 30 bedrooms (given the likely mistake of recording that many rooms in much smaller houses in terms of sqft); we dropped “NA” values from the dataset to simplify our analysis (“NA” values are hard to perform operations on); we converted “condition” and “grade” into factor variables because they are effectively intervals; and we ran “housing price” through a logarithmic function to make for better visualization.
Below is a visualization of the points in the dataset by price, plotted with the leaflet library. Note that the data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. More expensive houses tend to be concentrated near the water and center of the city.
Here instead is a visualization of the observations by property lot sqft. Again, data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. Our observation follows common sense: the further one ventures outside the city center, the more land there is.
descibe how price is min max etc, then inline code for price per sqft and how expensive seattle is etc
A brief overview of the dataset yields the following observations for housing price: the minimum price is $78,000, while the maximum is $7,700,000 (quite a large range); the mean of the dataset is $540,198 (indicating that the dataset is right-skewed, as further indicated by the histogram below); the standard deviation of the dataset is $367141.603; and the variance is 134,792,956,735 (quite large, indicating “that the data points are very spread out from the mean, and from one another” (https://bit.ly/2MZ1cCn).
Just for context, the following readouts offer cross sections of Seattle’s most expensive houses; average prices for each condition level; and average prices for each grade level.
From slicing the data, it looks like the most expensive houses are very well constructed, have tens of thousands of square feet of property, and have 5 or more bedrooms.
## id date price bedrooms bathrooms sqft_living
## 1 8907500070 20150413T000000 5350000 5 5.00 8000
## 2 9808700762 20140611T000000 7062500 5 4.50 10040
## 3 2470100110 20140804T000000 5570000 5 5.75 9200
## 4 6762700020 20141013T000000 7700000 6 8.00 12050
## 5 9208900037 20140919T000000 6885000 6 7.75 9890
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 23985 2.0 3 12 6720 1280 2009
## 2 37325 2.0 3 11 7680 2360 1940
## 3 35069 2.0 3 13 6200 3000 2001
## 4 27600 2.5 4 13 8570 3480 1910
## 5 31374 2.0 3 13 8860 1030 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98004 47.6 -122 4600 21750
## 2 2001 98004 47.6 -122 3930 25449
## 3 0 98039 47.6 -122 3560 24345
## 4 1987 98102 47.6 -122 3940 8800
## 5 0 98039 47.6 -122 4540 42730
## id date price bedrooms bathrooms sqft_living
## 1 6762700020 20141013T000000 7700000 6 8.00 12050
## 2 9808700762 20140611T000000 7062500 5 4.50 10040
## 3 9208900037 20140919T000000 6885000 6 7.75 9890
## 4 2470100110 20140804T000000 5570000 5 5.75 9200
## 5 8907500070 20150413T000000 5350000 5 5.00 8000
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 27600 2.5 4 13 8570 3480 1910
## 2 37325 2.0 3 11 7680 2360 1940
## 3 31374 2.0 3 13 8860 1030 2001
## 4 35069 2.0 3 13 6200 3000 2001
## 5 23985 2.0 3 12 6720 1280 2009
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1987 98102 47.6 -122 3940 8800
## 2 2001 98004 47.6 -122 3930 25449
## 3 0 98039 47.6 -122 4540 42730
## 4 0 98039 47.6 -122 3560 24345
## 5 0 98004 47.6 -122 4600 21750
From the slice below, average prices seem to trend upward along with condition; average prices are in the hundreds of thousands.
## condition price
## 1 1 341067
## 2 2 328149
## 3 3 542089
## 4 4 521274
## 5 5 612402
From the slice below, we can see that price generally trends upward along with grade.
## grade price
## 1 3 262000
## 2 4 212002
## 3 5 248524
## 4 6 301920
## 5 7 402566
## 6 8 542944
## 7 9 773513
## 8 10 1071771
## 9 11 1496842
## 10 12 2201285
## 11 13 3709615
add histograms for bedroom variable and sqft living variable - comment on them, why such house be that in real world etc etc..
Below we have included histograms for “bedrooms”, “sqft_living”, and “sqft_lot”. Upon inspecting the graphs, it becomes clear that most of the properties in this dataset have around 3 bedrooms, while the majority of properties are around 1000-2000 square feet (for reference, in 2015, the average US house size was around 2,600 square feet (https://bit.ly/32zY9Hi)); as for sqft_lot, most of the properties have between 5,000 and 10,000 square feet of land. As with the housing price histogram shown earlier, these histograms are right-skewed.
introduce the reasoning of this passage and thendescribe each output with reason under each of them
Going into this project, we hypothesized that larger houses would be priced higher than smaller houses. House size is determined in large part by “sqft_living”, of which “bathrooms” and “bedrooms” are a part.
It is apparent here that the largest houses are also among the most expensive – they are all priced in the millions of dollars, which are outliers when compared to the dataset as a whole.
## id date price bedrooms bathrooms sqft_living
## 1 9808700762 20140611T000000 7062500 5 4.50 10040
## 2 6762700020 20141013T000000 7700000 6 8.00 12050
## 3 1924059029 20140617T000000 4668000 5 6.75 9640
## 4 9208900037 20140919T000000 6885000 6 7.75 9890
## 5 1225069038 20140505T000000 2280000 7 8.00 13540
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 37325 2.0 3 11 7680 2360 1940
## 2 27600 2.5 4 13 8570 3480 1910
## 3 13068 1.0 3 12 4820 4820 1983
## 4 31374 2.0 3 13 8860 1030 2001
## 5 307752 3.0 3 12 9410 4130 1999
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 2001 98004 47.6 -122 3930 25449
## 2 1987 98102 47.6 -122 3940 8800
## 3 2009 98040 47.6 -122 3270 10454
## 4 0 98039 47.6 -122 4540 42730
## 5 0 98053 47.7 -122 4850 217800
## id date price bedrooms bathrooms sqft_living
## 1 1225069038 20140505T000000 2280000 7 8.00 13540
## 2 6762700020 20141013T000000 7700000 6 8.00 12050
## 3 9808700762 20140611T000000 7062500 5 4.50 10040
## 4 9208900037 20140919T000000 6885000 6 7.75 9890
## 5 1924059029 20140617T000000 4668000 5 6.75 9640
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 307752 3.0 3 12 9410 4130 1999
## 2 27600 2.5 4 13 8570 3480 1910
## 3 37325 2.0 3 11 7680 2360 1940
## 4 31374 2.0 3 13 8860 1030 2001
## 5 13068 1.0 3 12 4820 4820 1983
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98053 47.7 -122 4850 217800
## 2 1987 98102 47.6 -122 3940 8800
## 3 2001 98004 47.6 -122 3930 25449
## 4 0 98039 47.6 -122 4540 42730
## 5 2009 98040 47.6 -122 3270 10454
The properties with the largest amount of land are also priced highly, but not as highly as those listed in the “sqft_living” readout. This could suggest a lower correlation between housing price and sqft_lot than that between housing price and sqft_living.
## id date price bedrooms bathrooms sqft_living
## 1 1020069017 20150327T000000 700000 4 1.00 1300
## 2 722069232 20140905T000000 998000 4 3.25 3770
## 3 2623069031 20140521T000000 542500 5 3.25 3010
## 4 2323089009 20150119T000000 855000 4 3.50 4030
## 5 3326079016 20150504T000000 190000 2 1.00 710
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 1651359 1.0 4 6 1300 0 1920
## 2 982998 2.0 3 10 3770 0 1992
## 3 1074218 1.5 5 8 2010 1000 1931
## 4 1024068 2.0 3 10 4030 0 2006
## 5 1164794 1.0 2 5 710 0 1915
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98022 47.2 -122 2560 425581
## 2 0 98058 47.4 -122 2290 37141
## 3 0 98027 47.5 -122 2450 68825
## 4 0 98045 47.5 -122 1830 11700
## 5 0 98014 47.7 -122 1680 16730
## id date price bedrooms bathrooms sqft_living
## 1 1020069017 20150327T000000 700000 4 1.00 1300
## 2 3326079016 20150504T000000 190000 2 1.00 710
## 3 2623069031 20140521T000000 542500 5 3.25 3010
## 4 2323089009 20150119T000000 855000 4 3.50 4030
## 5 722069232 20140905T000000 998000 4 3.25 3770
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 1651359 1.0 4 6 1300 0 1920
## 2 1164794 1.0 2 5 710 0 1915
## 3 1074218 1.5 5 8 2010 1000 1931
## 4 1024068 2.0 3 10 4030 0 2006
## 5 982998 2.0 3 10 3770 0 1992
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98022 47.2 -122 2560 425581
## 2 0 98014 47.7 -122 1680 16730
## 3 0 98027 47.5 -122 2450 68825
## 4 0 98045 47.5 -122 1830 11700
## 5 0 98058 47.4 -122 2290 37141
The properties with the largest number of bedrooms are also priced highly (around or above the dataset mean of $540,000), but not quite as highly as those in the “sqft_living” readout. Since “bedrooms” contributes in part – but not in whole – to sqft_living, it makes sense that its correlation with housing price is lower than that of sqft_living.
## id date price bedrooms bathrooms sqft_living
## 1 1773100755 20140821T000000 520000 11 3.00 3000
## 2 627300145 20140814T000000 1148000 10 5.25 4590
## 3 5566100170 20141029T000000 650000 10 2.00 3610
## 4 8812401450 20141229T000000 660000 10 3.00 2920
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 4960 2 3 7 2400 600 1918
## 2 10920 1 3 9 2500 2090 2008
## 3 11914 2 4 7 3010 600 1958
## 4 3745 2 4 7 1860 1060 1913
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1999 98106 47.6 -122 1420 4960
## 2 0 98004 47.6 -122 2730 10400
## 3 0 98006 47.6 -122 2040 11914
## 4 0 98105 47.7 -122 1810 3745
## id date price bedrooms bathrooms sqft_living
## 1 1773100755 20140821T000000 520000 11 3.00 3000
## 2 627300145 20140814T000000 1148000 10 5.25 4590
## 3 5566100170 20141029T000000 650000 10 2.00 3610
## 4 8812401450 20141229T000000 660000 10 3.00 2920
## sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1 4960 2 3 7 2400 600 1918
## 2 10920 1 3 9 2500 2090 2008
## 3 11914 2 4 7 3010 600 1958
## 4 3745 2 4 7 1860 1060 1913
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1999 98106 47.6 -122 1420 4960
## 2 0 98004 47.6 -122 2730 10400
## 3 0 98006 47.6 -122 2040 11914
## 4 0 98105 47.7 -122 1810 3745
Are houses of different sizes priced differently?
Now that we’ve taken a look at slices of the data, we can now delve deeper with some graphs. Below are scatterplots and boxplots of housing price vs. “sqft_living”.
From the scatterplot, it’s apparent that there is a relatively strong, positive correlation between housing price and living space (.70192, to be exact). That is, as living space increases, so does housing price. Note that a majority of the data points lie below 6,000 sqft, and below $2 million.
Now let’s take a look at the same data with a boxplot; this time, we have “sqft_living” categorized by 5 intervals. From this visualization as well, it’s apparent that “sqft_living” correlates positively with housing price. The last interval (10,891-13,540 sqft) seems to buck this trend, but it worth noting that only 2 houses are part of this group – a small-n population, which could explain the discrepancy.
explains bp test with inline code and not possible anova
Next up in our exploratory data analysis is housing price vs. “sqft_lot”. How does land area correlate with housing price? According to our scatterplot, not very highly – there is a positive correlation of only .08988. This seems to suggest that sqft_lot is more weakly related to housing price than sqft_living. Indeed, the vast majority of data points in the scatterplot seem to trend upward in price with relatively small increases in land area.
Next, let’s take a look at the same data in a boxplot. Unfortunately, the visualization isn’t very readable; let’s convert housing price through a logarithmic function to improve our y-axis scale.
The modified boxplot below (with the logarithmic scale) is much easier to interpret. We can see that housing price increases as land area increases, but only to an extent. Note that houses in the 991,000-1.3M sqft and 1.3M-1.65M sqft ranges appear to buck the trend. Once again, this can be explained by the fact that only a few houses are part of these two intervals – only 4 to be exact.
explain results of anova inline code etc
##
## studentized Breusch-Pagan test
##
## data: kc_house_data$price ~ sqft.lot
## BP = 0.2, df = 4, p-value = 1
## Df Sum Sq Mean Sq F value Pr(>F)
## sqft.lot 4 3896930121642 974232530411 7.24 0.0000081 ***
## Residuals 21591 2906956970577216 134637440164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = kc_house_data$price ~ sqft.lot)
##
## $sqft.lot
## diff lwr upr p adj
## 330K-660K-520-330K 130987 -15181 277155 0.104
## 660K-991K-520-330K 619510 265542 973477 0.000
## 991K-1.3M-520-330K -10511 -588471 567449 1.000
## 1.3M-1.65M-520-330K 160322 -840687 1161331 0.992
## 660K-991K-330K-660K 488523 105684 871361 0.005
## 991K-1.3M-330K-660K -141498 -737577 454580 0.967
## 1.3M-1.65M-330K-660K 29335 -982244 1040914 1.000
## 991K-1.3M-660K-991K -630021 -1307691 47650 0.083
## 1.3M-1.65M-660K-991K -459187 -1520893 602518 0.763
## 1.3M-1.65M-991K-1.3M 170833 -985006 1326672 0.994
Here, we have a logarithmic box plot of housing price vs. “bedrooms”. There appears to be a clear trend: as the number of bedrooms increases, housing price increases as well. The 9-11 interval bucks the trend slightly, but again, this can be explained by the fact that only 10 houses are part of this interval, compared with 21586 total for the others.
same for anova bedrooms
##
## studentized Breusch-Pagan test
##
## data: kc_house_data$price ~ number.bedrooms
## BP = 202, df = 3, p-value <0.0000000000000002
can use chisq to see if more rooms more cost - smart question to intro, explain chisq (hypothesis, categ tranformation, results etc
## Warning in chisq.test(bed_p): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: bed_p
## X-squared = 2287, df = 21, p-value <0.0000000000000002
## Warning in chisq.test(bath_p): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: bath_p
## X-squared = 4534, df = 21, p-value <0.0000000000000002
Are houses of different quality priced differently? check all these questions make sense given what nick said
**explain well this variable, what represents etc as they asked*
Here, we compare “condition” with housing price. Once again, “condition” represents an index from 1 to 5, with the lowest number representing poor condition. Once we take a look at the boxplot below (the second one is logarithmized for clearer visualization), it becomes clear that apartment condition correlates positively with housing price.
explains anova test tukey and all
##
## studentized Breusch-Pagan test
##
## data: price ~ condition
## BP = 5, df = 4, p-value = 0.3
## Df Sum Sq Mean Sq F value
## kc_house_data$condition 4 19739857713567 4934964428392 36.9
## Residuals 21591 2891114042985296 133903665554
## Pr(>F)
## kc_house_data$condition <0.0000000000000002 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = kc_house_data$price ~ kc_house_data$condition)
##
## $`kc_house_data$condition`
## diff lwr upr p adj
## 2-1 -12918 -213478 187642 1.000
## 3-1 201022 15459 386585 0.026
## 4-1 180207 -5637 366051 0.062
## 5-1 271335 84389 458280 0.001
## 3-2 213940 136914 290965 0.000
## 4-2 193125 115424 270825 0.000
## 5-2 284253 203953 364552 0.000
## 4-3 -20815 -36519 -5111 0.003
## 5-3 70313 44676 95950 0.000
## 5-4 91128 63529 118727 0.000
explain well this variable, what represents etc as they asked, design/construction
Here we have a boxplot comparing “grade” with housing price. Once again, “grade” represents an index from 1 to 13, with the lowest number representing poor construction and design. The trend is clear: construction and design grade correlate positively with housing price.
no anova..
explains chi-sq results and how they related
## Warning in chisq.test(cond.tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cond.tbl
## X-squared = 1457, df = 40, p-value <0.0000000000000002
Are older houses priced differently?
Here, we have a comparison of “yr_built” with housing price. Once we take a look at the logarithmic boxplot, we see no obvious trends. Housing price trends downward from 1900-1969, and then picks back up from 1970-2015. What might explain this? Well, “yr_built” does not take “yr_renovated” into account. For instance, two equivalent houses built in the same year could have different house prices, depending on if one has been renovated while the other hasn’t.
Let’s construct the same boxplots, but this time indexed by “yr_renovated”.
Unfortunately, the vast majority of properties in this dataset have never been renovated (20682 to be exact). This means that only 914 properties have been renovated. This makes the resulting boxplots somewhat uninformative – the larger population boxplot (not renovated) largely mirrors the patterns of the previous graph, and the smaller population graph (renovated) is based on a population too small to run meaningful analysis on. We have included the graphs here to showcase our thought process, but we are well aware of their limitations.
Here, we’ve graphed housing price by yr_renovated itself. This graph also showcases only 914 properties – the ones that were renovated. Generally speaking, as “yr_renovated” approaches the present day, price increases. The exception is between the 1924-1946 and 1947-1969 intervals; note however, that only 9 properties occupy the first interval.
no anova..
##
## studentized Breusch-Pagan test
##
## data: kc_house_data$price ~ year.built
## BP = 41, df = 4, p-value = 0.00000003
explain results and problems with it, introduce for next analysis below
charts and all
do anova for this one too
compared renovated with built and see price changes too..
use chisq to assess old and quality independence
## Warning in chisq.test(gradey.tbl): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: gradey.tbl
## X-squared = 6404, df = 40, p-value <0.0000000000000002
## Warning in chisq.test(condy.tbl): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: condy.tbl
## X-squared = 4844, df = 16, p-value <0.0000000000000002
use chisq to assess renovated and quality indep - replicate above chisq tests for year.renov
subset for older houses with renovation and rerun both tests and see difference in time
like older house renovated cost normal? median house are still not renovated? how is the condition grade for older one? how is it for older renovated how is it for newer? how price changes in all these dynamics etc like they wanted us to go in depth of that cause it actually makes sense
What factors influence the house price the most?
Below is our regression model, along with a comprehensive correlation plot.
yr_built and sqft_lot seem unrelated to price as their correlation coefficient is almost 0; accordingly, we do not choose them as independent variables to predict house price.
##
## Call:
## lm(formula = price ~ . - sqft_lot - yr_built, data = h2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1706400 -142615 -21697 101727 4133425
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80549.20 7851.17 10.26 < 0.0000000000000002 ***
## bedrooms -64528.79 2446.40 -26.38 < 0.0000000000000002 ***
## bathrooms 5499.45 3852.17 1.43 0.15341
## sqft_living 340.68 4.99 68.33 < 0.0000000000000002 ***
## floors 14875.02 4299.43 3.46 0.00054 ***
## sqft_above -36.54 5.03 -7.26 0.0000000000004 ***
## sqft_basement NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257000 on 21590 degrees of freedom
## Multiple R-squared: 0.51, Adjusted R-squared: 0.509
## F-statistic: 4.49e+03 on 5 and 21590 DF, p-value: <0.0000000000000002
The coefficient of “sqft_basement” is NA, which indicates it has a problem with the other variables, so we dropped this one. And the p-value of “bathroom” is too large (meaning it’s insignificant), so we dropped this one as well.
##
## Call:
## lm(formula = price ~ . - sqft_lot - yr_built - sqft_basement -
## bathrooms, data = h2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1711175 -142619 -21684 101736 4134675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81162.16 7839.61 10.35 < 0.0000000000000002 ***
## bedrooms -63928.53 2410.05 -26.53 < 0.0000000000000002 ***
## sqft_living 343.98 4.42 77.89 < 0.0000000000000002 ***
## floors 17336.10 3938.78 4.40 0.000010807578329 ***
## sqft_above -37.41 5.00 -7.49 0.000000000000074 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257000 on 21591 degrees of freedom
## Multiple R-squared: 0.51, Adjusted R-squared: 0.509
## F-statistic: 5.61e+03 on 4 and 21591 DF, p-value: <0.0000000000000002
## bedrooms sqft_living floors sqft_above
## 1.55 5.37 1.48 5.59
Everything looks better now; we also checked the VIF value of each variable and none of them is too large, indicating no multicollineraty. We then added the two factor variables (“grade” and “condition”) into the dataset to see their effects.
## price bedrooms sqft_living sqft_above
## Min. : 78000 Min. : 1.00 Min. : 370 Min. : 370
## 1st Qu.: 322000 1st Qu.: 3.00 1st Qu.: 1430 1st Qu.:1190
## Median : 450000 Median : 3.00 Median : 1910 Median :1560
## Mean : 540198 Mean : 3.37 Mean : 2080 Mean :1789
## 3rd Qu.: 645000 3rd Qu.: 4.00 3rd Qu.: 2550 3rd Qu.:2210
## Max. :7700000 Max. :11.00 Max. :13540 Max. :9410
##
## floors grade condition
## Min. :1.00 7 :8973 1: 29
## 1st Qu.:1.00 8 :6065 2: 170
## Median :1.50 9 :2615 3:14020
## Mean :1.49 6 :2038 4: 5677
## 3rd Qu.:2.00 10 :1134 5: 1700
## Max. :3.50 11 : 399
## (Other): 372
##
## Call:
## lm(formula = price ~ ., data = h3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1571904 -122108 -22410 88448 4645573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119945.02 235012.20 0.51 0.60979
## bedrooms -27981.05 2255.58 -12.41 < 0.0000000000000002 ***
## sqft_living 229.90 4.37 52.66 < 0.0000000000000002 ***
## sqft_above -89.75 4.65 -19.29 < 0.0000000000000002 ***
## floors 25574.69 3798.02 6.73 0.000000000017 ***
## grade4 52746.97 235247.78 0.22 0.82259
## grade5 47680.72 231471.19 0.21 0.83680
## grade6 77263.78 231054.85 0.33 0.73808
## grade7 110396.50 231036.91 0.48 0.63278
## grade8 183631.54 231071.92 0.79 0.42680
## grade9 327826.56 231145.29 1.42 0.15613
## grade10 530666.16 231258.19 2.29 0.02176 *
## grade11 829031.75 231551.46 3.58 0.00034 ***
## grade12 1357863.56 232717.34 5.83 0.000000005462 ***
## grade13 2552196.47 240535.18 10.61 < 0.0000000000000002 ***
## condition2 -60791.74 46520.27 -1.31 0.19130
## condition3 -62199.90 43243.71 -1.44 0.15035
## condition4 -5587.43 43281.73 -0.13 0.89728
## condition5 71583.51 43532.93 1.64 0.10012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 231000 on 21577 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.604
## F-statistic: 1.83e+03 on 18 and 21577 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above floors grade4 grade5
## 1.68 6.51 6.01 1.70 27.99 240.44
## grade6 grade7 grade8 grade9 grade10 grade11
## 1847.93 5250.37 4367.68 2302.97 1077.66 393.79
## grade12 grade13 condition2 condition3 condition4 condition5
## 90.02 14.10 6.85 172.49 147.02 55.66
The 5 levels of the “condition” variable are all insignificant, so we can drop the “condition” variable. For the “grade” variable, higher grade levels have significant effects on price. By contrast, low grade does not affect price significantly.
##
## Call:
## lm(formula = price ~ . - condition, data = h3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1594879 -122654 -26548 89616 4612348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203437.68 234053.01 0.87 0.3848
## bedrooms -25531.89 2283.29 -11.18 < 0.0000000000000002 ***
## sqft_living 241.42 4.40 54.90 < 0.0000000000000002 ***
## sqft_above -101.49 4.69 -21.65 < 0.0000000000000002 ***
## floors 11331.97 3765.10 3.01 0.0026 **
## grade4 -58531.64 238316.99 -0.25 0.8060
## grade5 -47771.75 234518.50 -0.20 0.8386
## grade6 -24710.22 234099.87 -0.11 0.9159
## grade7 2597.97 234076.33 0.01 0.9911
## grade8 71562.72 234108.48 0.31 0.7598
## grade9 212286.01 234180.94 0.91 0.3647
## grade10 412631.72 234294.46 1.76 0.0782 .
## grade11 707325.79 234587.29 3.02 0.0026 **
## grade12 1233712.59 235767.03 5.23 0.00000017 ***
## grade13 2416245.65 243679.96 9.92 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 234000 on 21581 degrees of freedom
## Multiple R-squared: 0.594, Adjusted R-squared: 0.594
## F-statistic: 2.26e+03 on 14 and 21581 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above floors grade4 grade5
## 1.68 6.43 5.94 1.63 27.97 240.31
## grade6 grade7 grade8 grade9 grade10 grade11
## 1846.94 5247.31 4365.02 2301.52 1076.98 393.53
## grade12 grade13
## 89.96 14.09
Now, we’ve added the interaction term into the model, since we want to see if the correlation of variables would affect the price prediction. We first put all interactions into the model to see what would happen.
##
## Call:
## lm(formula = price ~ . + bedrooms:sqft_living + bedrooms:floors +
## bedrooms:sqft_above + sqft_living:floors + sqft_living:sqft_above +
## floors:sqft_above, data = h4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3695612 -119737 -25450 86085 3346184
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 245214.63848 229404.26957 1.07
## bedrooms -51836.57174 7294.93923 -7.11
## sqft_living 86.13585 16.53152 5.21
## sqft_above -19.70424 21.95986 -0.90
## floors 36232.45799 14886.82743 2.43
## grade4 -45196.54696 233098.37846 -0.19
## grade5 -15410.52348 229421.70528 -0.07
## grade6 21127.44809 229036.61201 0.09
## grade7 74426.60451 229048.12877 0.32
## grade8 160884.19715 229093.63765 0.70
## grade9 310355.78334 229162.33811 1.35
## grade10 492220.91881 229263.64399 2.15
## grade11 721454.84294 229542.28824 3.14
## grade12 1105961.32670 230837.39838 4.79
## grade13 1837445.26776 239796.69406 7.66
## bedrooms:sqft_living -16.02514 3.80857 -4.21
## bedrooms:floors 25728.79001 5273.85824 4.88
## bedrooms:sqft_above 20.28226 5.16049 3.93
## sqft_living:floors 101.59363 8.63259 11.77
## sqft_living:sqft_above 0.03473 0.00195 17.79
## sqft_above:floors -177.53444 9.78398 -18.15
## Pr(>|t|)
## (Intercept) 0.2851
## bedrooms 0.000000000001233 ***
## sqft_living 0.000000190166631 ***
## sqft_above 0.3696
## floors 0.0149 *
## grade4 0.8463
## grade5 0.9464
## grade6 0.9265
## grade7 0.7452
## grade8 0.4825
## grade9 0.1757
## grade10 0.0318 *
## grade11 0.0017 **
## grade12 0.000001669853101 ***
## grade13 0.000000000000019 ***
## bedrooms:sqft_living 0.000025908368174 ***
## bedrooms:floors 0.000001076292883 ***
## bedrooms:sqft_above 0.000085104651817 ***
## sqft_living:floors < 0.0000000000000002 ***
## sqft_living:sqft_above < 0.0000000000000002 ***
## sqft_above:floors < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229000 on 21575 degrees of freedom
## Multiple R-squared: 0.612, Adjusted R-squared: 0.611
## F-statistic: 1.7e+03 on 20 and 21575 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above
## 17.9 95.0 136.2
## floors grade4 grade5
## 26.6 28.0 240.4
## grade6 grade7 grade8
## 1848.1 5252.3 4369.7
## grade9 grade10 grade11
## 2304.0 1078.0 393.9
## grade12 grade13 bedrooms:sqft_living
## 90.2 14.3 144.4
## bedrooms:floors bedrooms:sqft_above sqft_living:floors
## 70.0 194.0 150.2
## sqft_living:sqft_above sqft_above:floors
## 32.1 173.0
We dropped the insignificant interactions and some interactions would cause certain variables to be insignificant as well, so we also drop these variables. Here is what’s left; this model seems nice.
##
## Call:
## lm(formula = price ~ . + bedrooms:sqft_above + sqft_living:sqft_above,
## data = h4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3552516 -119684 -27362 86763 3536811
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 301536.38692 230960.70303 1.31
## bedrooms -37942.22923 5140.56726 -7.38
## sqft_living 174.31529 5.84404 29.83
## sqft_above -232.12312 9.01099 -25.76
## floors 14271.05585 3722.97985 3.83
## grade4 -35043.35004 235079.61029 -0.15
## grade5 9116.80210 231363.17493 0.04
## grade6 50285.38874 230969.99192 0.22
## grade7 107201.66962 230973.41942 0.46
## grade8 195978.96124 231016.35221 0.85
## grade9 342536.69129 231090.06081 1.48
## grade10 523989.90325 231190.83681 2.27
## grade11 751707.87220 231468.92597 3.25
## grade12 1141807.53068 232773.60628 4.91
## grade13 1933822.97431 241686.51914 8.00
## bedrooms:sqft_above 12.57344 2.48837 5.05
## sqft_living:sqft_above 0.02832 0.00169 16.75
## Pr(>|t|)
## (Intercept) 0.19171
## bedrooms 0.0000000000001629 ***
## sqft_living < 0.0000000000000002 ***
## sqft_above < 0.0000000000000002 ***
## floors 0.00013 ***
## grade4 0.88150
## grade5 0.96857
## grade6 0.82765
## grade7 0.64256
## grade8 0.39626
## grade9 0.13828
## grade10 0.02343 *
## grade11 0.00117 **
## grade12 0.0000009399814052 ***
## grade13 0.0000000000000013 ***
## bedrooms:sqft_above 0.0000004387237711 ***
## sqft_living:sqft_above < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 231000 on 21579 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.605
## F-statistic: 2.07e+03 on 16 and 21579 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above
## 8.75 11.67 22.55
## floors grade4 grade5
## 1.64 27.97 240.39
## grade6 grade7 grade8
## 1847.89 5251.23 4368.71
## grade9 grade10 grade11
## 2303.51 1077.80 393.79
## grade12 grade13 bedrooms:sqft_above
## 90.13 14.24 44.35
## sqft_living:sqft_above
## 23.68
explain coefficients and all that - use inline coding etc.. ### Price = 142000 +bedrooms(-31710+10.72sqft_above)+sqft_living(170+2.943sqft_above)+sqft_above(-228.6)+floors14570+grade()*
Problem: As the price histogram above is quite left-skewed, it means there are many outliers whose price is very high in the dataset. While we built the model, we did not exclude the outliers as we considered these values important. As a result, our final model is also skewed a bit. It means that for low price houses, our model may predict higher-than-normal prices, and for high price houses, our model will predict lower-than-normal prices.
final insights, main relationships, predictors, what to look in a real estate dataset etc.. future openings for further studies, analysis, tests etc on this..